Topic: Why can the PCA algorithm on the NanoRam 785 differentiate between calcium nitrate and magnesium nitrate, but the RVM algorithm on the NanoRam 1064 cannot?
Principal component Analysis (PCA) is a procedure which takes a set of variables and transforms them into a new set of variables which have no collinearity, called principal components. Dimensionality reduction techniques such as PCA are often used in spectral classification and prediction due to the large number of variables present in each spectrum. Each data point in a spectrum is considered to be a variable, and reducing the number of variables is crucial in creating a meaningful model. PCA uses an approach called feature extraction, which involves creating new features from recombinations of existing variables in order to sufficiently describe a dataset with a smaller amount of variables.
First, methods containing 20 scans each were built using the B&W Tek handheld Raman analyzer, NanoRam 785, for a single sample of both magnesium nitrate and calcium nitrate. Then the same nitrate samples were measured using a B&W Tek NanoRam 1064 handheld Raman analyzer to create a five scan method for each sample. The raw data was processed with dark subtraction, scaling, and centering. Running PCA on the processed 785 nm spectral data yields the following:
Scree plots show the eigenvalues of the principal components. The first one depicts the raw values of the variance on the y axis and the principal components in descending order by eigenvalue magnitude:
The second plot depicts the percentage of the total variance explained by each principal component, again ordered in descending order by eigenvalue magnitude:
The information in the preceding graphs is also displayed analytical in the below table:
## Importance of first k=5 (out of 40) components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 22.4054 4.68831 3.55776 2.43046 2.2113
## Proportion of Variance 0.8626 0.03777 0.02175 0.01015 0.0084
## Cumulative Proportion 0.8626 0.90031 0.92206 0.93221 0.9406
A 95% confidence ellipse was constructed for each class . Under a large number of repeated sampling from the underlying distribution, and each time calculate a confidence ellipse, 95% of the constructed ellipses would contain the underlying mean of the distribution.
The calcium nitrate samples had low in-class variance, but the magnesium nitrate samples had a few massive outliers that significantly raised the variance and heavily affected the PC construction. Furthermore, the between-class variance is fairly small, the two main clusters are right next to each other. Can we do better?
Scaling before running PCA is a seemingly contentious issue in data science and spectral literature. There is a general consensus that scaling is required when variables have different units or greatly differing standard deviations. However, spectral data variables are all just relative intensities at certain pixels, which means that the variables will have the same units and are all measured on the same scale. Given that these two conditions are met, it would be acceptable and potentially even advantageous not to scale the variables and perform the PCA based on the covariance matrix. Otherwise, scaling the variables and using the correlation matrix to perform the PCA makes more sense.
Outliers can also have a substantial impact on the construction of principal components. From part one, it was immediately obvious that there was an outlier present from the magnesium nitrate group skewing the results of the PCA. Examining the spectra in a spectroscopy software called BWIQ graphically indicates that magnesium nitrate scan #12 is vastly different from the other magnesium nitrate scans and was likely due to human error in taking the scan of the sample. Two other potential outliers are shown in the PCA graph from part one, but spectrally they are not dissimilar to the other samples in the method, and after running PCA again with the largest outlier removed, those two scans are no longer problematic.
First, methods containing 20 scans each were built using the B&W Tek handheld Raman analyzer, NanoRam 785, for a single sample of both magnesium nitrate and calcium nitrate. The raw data was processed with dark subtraction, and centering. Magnesium nitrate sample scan #12 was removed from the dataset as an outlier. Running PCA on the processed 785 nm spectral data yields the following:
## Importance of first k=5 (out of 39) components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 7784.6314 4094.0216 1.466e+03 730.90832 621.10275
## Proportion of Variance 0.7284 0.2015 2.582e-02 0.00642 0.00464
## Cumulative Proportion 0.7284 0.9299 9.557e-01 0.96215 0.96678
Reduced Variable Multivariate (RVM) is another dimensionality reduction procedure which takes a set of variables and deletes many of the variables until a small subset of variables that accurately describe the dataset remain. RVM uses an approach called feature deletion, which differs from feature extraction in a couple ways. First, since variables are just removed instead of being transformed into new ones, there is a loss of data present in feature selection, whereas all of the original data is preserved in feature extraction. Second, there is no guarantee that the variables are free of collinearity in feature selection, while feature extraction does guarantee orthogonal variables with no collinearity. Before attempting to analyze the data collected on the 1064 nm device with RVM, first it will be run through the PCA model as a baseline for comparison.
There don’t seem to be any outliers present, and learning from the previoius sections, the variables are not scaled in the preprocessing steps.
## Importance of first k=3 (out of 12) components:
## PC1 PC2 PC3
## Standard deviation 3.019e+04 1.356e+04 3.713e+03
## Proportion of Variance 8.159e-01 1.645e-01 1.234e-02
## Cumulative Proportion 8.159e-01 9.803e-01 9.927e-01
Currently the ellipses are not being drawn…need to fix error
The PCA algorithm has no trouble distinguising between the two kinds of nitrates on the data from the 1064 nm device either. This would point to a flaw in the RVM algorithm, as the RVM method had specificity issues when examining the nitrate samples on the 1064 nm device. Further investigation will be conducted to uncover the issues the 1064 nm device is having with its current algorithm.
I’m going to try to build an RVM algorithm from scratch since everything is soooooo top secret here. Let’s see how it turns out. All I have to go on is that RVM uses feature selection, so there is probably collinearity between the remaining variables, and that a variable inflation factor is applied, which definitely causes a decrease in specificity (more false positives). There appears to be no way to alter the default inflation factor of 5 in the device, although the internal paper mentions that it can be lowered to deal with false positive issues.
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] factoextra_1.0.6 ggfortify_0.4.8 data.table_1.12.2
## [4] plyr_1.8.4 forcats_0.4.0 stringr_1.4.0
## [7] dplyr_0.8.3 purrr_0.3.2 readr_1.3.1
## [10] tidyr_1.0.0 tibble_2.1.3 tidyverse_1.2.1
## [13] plotly_4.9.1 ggplot2_3.2.1
##
## loaded via a namespace (and not attached):
## [1] ggrepel_0.8.1 Rcpp_1.0.2 lubridate_1.7.4
## [4] lattice_0.20-38 assertthat_0.2.1 zeallot_0.1.0
## [7] digest_0.6.21 mime_0.7 R6_2.4.0
## [10] cellranger_1.1.0 backports_1.1.4 evaluate_0.14
## [13] httr_1.4.1 pillar_1.4.2 rlang_0.4.0
## [16] lazyeval_0.2.2 readxl_1.3.1 rstudioapi_0.10
## [19] rmarkdown_1.15 labeling_0.3 htmlwidgets_1.3
## [22] munsell_0.5.0 shiny_1.3.2 broom_0.5.2
## [25] compiler_3.6.1 httpuv_1.5.2 modelr_0.1.5
## [28] xfun_0.9 pkgconfig_2.0.3 htmltools_0.3.6
## [31] tidyselect_0.2.5 gridExtra_2.3 viridisLite_0.3.0
## [34] crayon_1.3.4 withr_2.1.2 ggpubr_0.2.4
## [37] later_0.8.0 grid_3.6.1 xtable_1.8-4
## [40] nlme_3.1-140 jsonlite_1.6 gtable_0.3.0
## [43] lifecycle_0.1.0 magrittr_1.5 scales_1.0.0
## [46] cli_1.1.0 stringi_1.4.3 ggsignif_0.6.0
## [49] promises_1.0.1 xml2_1.2.2 generics_0.0.2
## [52] vctrs_0.2.0 tools_3.6.1 glue_1.3.1
## [55] hms_0.5.1 crosstalk_1.0.0 yaml_2.2.0
## [58] colorspace_1.4-1 rvest_0.3.4 knitr_1.25
## [61] haven_2.1.1